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ABSTRACT 

We present a novel technique, dubbed FiEstAS, to estimate the underlying density 
field from a discrete set of sample points in an arbitrary multidimensional space. 
FiEstAS assigns a volume to each point by means of a binary tree. Density is then 
computed by integrating over an adaptive kernel. 

As a first test, we construct several Monte Carlo realizations of a Hernquist profile 
and recover the particle density in both real and phase space. At a given point, Poisson 
noise causes the unsmoothed estimates to fluctuate by a factor ~ 2 regardless of the 
number of particles. This spread can be reduced to about 1 dex (~ 26 per cent) by our 
smoothing procedure. The density range over which the estimates are unbiased widens 
as the particle number increases. Our tests show that real-space densities obtained 
with an SPH kernel are significantly more biased than those yielded by FiEstAS. In 
phase space, about ten times more particles are required in order to achieve a similar 
accuracy. 

As a second application we have estimated phase-space de nsities in a dark matter 
halo from a cosmological simulation. We confirm the results of lArad ~ al. (2004) that 
the highest values of / are all associated with substructure rather than the main halo, 
and that the volume function v{f) ~ f~^'^ over about four orders of magnitude in /. 
We show that a modified version of the toy model proposed by Arad et al. explains 
this result and suggests that the departures of v{f) from power-law form are not mere 
numerical artefacts. We conclude that our algorithm accurately measure the phase- 
space density up to the limit where discreteness effects render the simulation itself 
unreliable. Computationally, FiEstAS is orders of magnitude faster than the method 
based on Delaunay tessellation that Arad et al. employed, making it practicable to 
recover smoothed density estimates for sets of 10^ points in 6 dimensions. 

Key words: methods: data analysis - methods: numerical - dark matter - galaxies: 
haloes - galaxies: kinematics and dynamics 



1 INTRODUCTION 

The estimation of a continuous density field from a set of 
discrete sample points is a common problem. For example, 
the estimation of the matter density in real space is funda- 
mental in numerical studies of cosmic structure formation, 
and lies at the heart of N-body codes that solv e the Poisson 
equation, such as A RT llKravtsov et alJll997l) or MLAPM 
jKnebeeTaDllQO^). It also plays an important role in hy- 
drodynamic codes based on Smooth Particle Hydrodynam- 
ics (SPH) (see e.g. iMonaghaiil Il992l : ISpringel fc Hernauisd 
[2002,, and references therein). 

Several problems of current interest involve the estima- 
tion of the density of p oints in phase space (e.g., iBinnevI 
I2OO4I: I Arad et aflliooi^ . Although violent relaxation and 
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phase-mixing tend to disrupt substructures, both stars and 
dark matter may retain information about their initial con- 
ditions for a relatively long time. In principle, it would be 
possible to trace the merging history of our galaxy by look- 
in g for p hase-space streams in the stellar population (see e.g. 
iHelmi fc'Whitelll999i) . These streams might be identified as 
regions of enhanc ed density in either real space, phase space, 
or integral space jHelmi et al.ll20()3) . 

Most algorithms for estimating densities require the im- 
position of a metric on the space occupied by sample points 
- that is, they assume that the 'distance' between every 
two points is defined. However, often the underlying phys- 
ical problem does not have any metric. In such a case, a 
metric can be imposed by specifying a matrix Kij that re- 
lates one unit along the i-th axis to one unit on the j-th 
axis. For example, in the case of phase space, we need a 
dimensional constant K to relate positions and velocities, 



2 Y. Ascasibar and J. Binney 



so that a velocity difference v is equivalent to a spatial off- 
set of Kv. The local velocity dispersion a would thus define 
a length scale Ka that will in general bear no relation to 
the actual length scale I. For example, in a singular isother- 
mal sphere, I oc r while Ka is constant. This effect will be 
present in many applications, amongst them the estimation 
of phase-space densities within dark matter haloes. 

In this paper we present an algorithm for density esti- 
mation that does not require a metric. We show that our 
method provides fast and accurate estimates of both real- 
space and phase-space densities. The algorithm's indepen- 
dence of the existence of a metric makes i t applicable to a 
wide class of problems in statistics (see e.g. ISilvermanlll986l 
for a review), in which one wishes to find clusters of data 
points in a space in which different axes represent quanti- 
ties with different physical dimensions. For example, a data 
point could be composed of a position, a redshift and lu- 
minosities in several bands, and one wishes to find group- 
ings of physically related points. Some automated classifica- 
tion algorithms based on probability densiti es are currently 
being applied to large galaxy surveys (e.g. iRichards et alJ 
120041) that could tremendously benefit from the possibility 
of working in a completely arbitrary space. 

When a metric is available for the sample space, esti- 
mator s based on kernels are widely used. For example, in 
SPH jLucvlll977l: iGineold fc Monaghanlll97'if) one obtains 
the density from 

JV 

/5sPH(r) = ^ mpW^(r - Tp, ft), (1) 

p=i 

where rrip is the mass of each particle, Vp its position, and 
W{r,h) is a kernel characterized by a smoothing scale, h. 
Adaptive resolution is achieved by setting h to the distance 
to the A^n-th nearest neighbour (Jiernauist & Katz..l989i,V 

A promising alternative to equa tion Voronoi tes- 
sellations (see e.g. lOkabe et al.lll993) have been used in as- 
trophysics to identify ov erdensities in a Poissonian di stri- 
bution of X-ray photons fEbeling fc Wieden mannlll993ll , as 
well as detecting galaxy clusters in observational surveys 
(e.g. iRamella et alJ l200ll : iKim et alJ l2002l : iMarinoni etal\ 
|20o3)- Similar in s pirit, the Delaunay T essellation Field 
Estimator (DTFE, iPelupessv et al.l l2003l and references 
therein) estimates the densities of a set of points from 
the volume of the Delaunay cells they belong to. A 
continuous field can be obtained by l inear interpolation 
iBernardeau Sz van de WevgaertlliggeTl . lArad et alJ ll2004D 
have recently applied the DTFE method to the estimation 
of six-dimensional phase-space densities in N-body exper- 
iments. They find that the volume distribution function, 
v{f), for relaxed haloes follows a power law over more than 
4 decades in /. For high values of the phase-space density, 
v{f) is dominated by 'cold' substructure rather than by the 
parent halo. 

However, the definition of distance underlying a Voronoi 
or Delaunay tessellation is not at all obvious in phase space. 
The chosen metric determines which particles are close 
neighbours, and thus the resulting tessellation and the den- 
sities that are assigned to each particle. The impact of such 
an ad-hoc choice is extremely difficult to quantify, and it 
would require a problem-by-problem study. It is easy to un- 
derstand, though, that one's choice may be of crucial impor- 



tance in the general case, in which very different scales (e.g. 
a halo with substructure) or obviously non-Euclidean spaces 
(e.g. number density of star-forming galaxies per stereora- 
dian per redshift per morphological type per luminosity) are 
invloved. On the other hand, we would like the results to 
be as stable as possible if we decided to work in different 
units (e.g. comoving distance instead of redshift, or abso- 
lute magnitude instead of luminosity) . No linear scaling Kij 
could accommodate such a transformation. The choice of 
units and coordinate types (e.g. radial, polar, logarithmic) 
is critical in metric-based schemes. 

We present here a non-metric Field Estimator for Arbi- 
trary Spaces (FiEstAS), thoroughly described in Section|5| 
Tests of the algorithm are presented in Section |3 where we 
attempt to recover both the r eal- and phase-sp ace density 
from random realizations of a iHernauistI il990t) profile, for 
which an analytical distribution function is known. We in- 
vestigate the phase-space structure of a realistic dark matter 
halo from an N-body simulation in Section 2] and discuss 
the computational performance of our method in Section |5| 
Section |n] summarizes our main conclusions. 



2 THE ALGORITHM 
2.1 Tessellation 

Like the DTFE method, FiEstAS is based on a tessella- 
tion of the d-dimensional space, i.e. a division of J?'' in into 
mutually disjoint polygons. However, we do not resort to a 
Delaunay or Voronoi tessellation. Although these can adapt 
very efficiently to the geometry of the problem, they require 
a metric to define distances. Moreover, they become compu- 
tationally expensive for large datasets, particularly when d 
is large. 

Instead, we use a binary tree to tessellate our hyper- 
volume by means of a recursive procedure. Given a set of 
A'^ points, we split the space along the i-th coordinate. In 
the ideal case, one would classify the points according to 
their position relative to the median Xi, thus obtaining two 
partitions with N/2 points each. In practice, we select the 
two closest points (one from each side) to the mean (xi) 
and split the i-th axis at a::spiit = ix+ + X-)/2. This is suf- 
ficiently accurate for our purposes, and much more efficient 
than computing the median. 

The process is repeated for both subsets until the initial 
space is divided in N hyper-boxes, each containing a single 
point. A first estimate of the local density is just the weight 
of the point, which we shall call its mass rup, divided by the 
volume of its hyper-box. 



The construction of a two-dimensional binary tree is il- 
lustrated in Fig. Q In the general case, the order in which 
the coordinates are selected for splitting should be chosen at 
random. Nevertheless, we try to respect the symmetries of 
our particular problem (the estimation of phase-space densi- 
ties) as much as possible, so we alternately split a real-space 
and a velocity-space coordinate. In both cases, we always 
select the axis with the highest elongation, {xfj — (xi)'^ . 
Proceeding in this way does not significantly affect the es- 
timated densities, but it alleviates numerical problems that 
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Figure 1. Construction of a two-dimensional binary tree. The 
first division occurs along the horizontal axis, and x^put is set 
between points 5 and 14. The left part is then divided along the 
vertical axis (between points 5 and 3), and so on. 



arise when two points have very similar (or equal) values of 
a coordinate Xi. It also yields cells that are more or less cu- 
bic when projected onto the position or velocity sub-spaces, 
which both have Euclidean metrics. However, we do not im- 
pose any relation between these sub-spaces, nor do we at- 
tempt to compute any 'distance' involving more than one 
coordinate at a time. 

As discussed above, Voronoi or Delaunay tessellations 
are not well defined in spaces lacking a metric. Our scheme 
is not be sensitive to the particular choice of units, the char- 
acteristic scale (if any) or the nature of each dimension, al- 
though the precise shape of our orthohedral cells depends 
on the choice of principal axes (for those sub-spaces admit- 
ting rotations) as well as on the splitting sequence. None the 
less, the uncertainity due to this non-uniqueness is compa- 
rable to or lower than the intrinsic Poisson noise of the point 
distribution. 



2.2 Boundary effects 

A problem common to all tessellations (with Delaunay trian- 
gulation probably yielding the smallest errors) is the choice 
of a boundary for the system. Since the density field can be 
completely arbitrary, there is no reason to expect that the 
point distribution under study should fit into an orthohedral 
bounding box. 

In particular, there is no warranty that the field will be 
well sampled near the boundaries. This is clearly seen, for ex- 
ample, if we take a finite distribution, fully contained within 
the bounding hyper-box, and enlarge the box without alter- 
ing the data points themselves. Most algorithms, including 
ours, will be stable in the innermost regions, where every 
point is surrounded by neighbours (i.e. the field is well sam- 
pled), but densities in the outskirts will be lower because 
each point would be associated with a larger volume. This 
phenomenon is not an issue when we are interested in a small 
subset of the point distribution or when periodic boundary 
conditions apply. Then, boundary effects are hidden within 
the Poisson noise. However, we expect to encounter this diffi- 
culty fairly often, since it arises in any distribution sampling 
an infinite space, specially when there is a sharp density gra- 
dient in the outermost regions. In the estimation of phase- 
space densities, it manifests most acutely for particles in the 
tails of the velocity distribution. 

We try to compensate this effect by redefining the 



boundary of the system. Consider, for example, a sample 
point that lies at x'^ within a box that extends to the cur- 
rent boundary of the sampled region, in a portion of the 
hyperplane Xi = Xmax- Let the opposite (interior) face of 
the point's volume be defined by the hyperplane Xi = Xmin. 
Then we bring the outer face of the box from Xmax into the 
hyperplane Xi = x'^ + (a;^ — Xmin). This redefinition of the 
outer face of the box reduces its volume and thus increases 
the derived density. The changes are small if the field is well 
sampled near the boundaries, because then the point will be, 
on average, in the middle of the original box. If the field is 
severely undersampled, x^ will lie much nearer to Xmin than 
Xma,x and the change will be more significant. The correc- 
tion is a crude attempt to adapt the shape of the boundary 
of the volume within which densities are determined to the 
region of space that is usefully sampled. 



2.3 Smoothing 

The prescription outlined above provides a very fast esti- 
mation of the density. It has the additional advantage that 
volume integrals of the type 



7= / *[p(a;)] d^a; 
Jv 

are readily computed as 



X {VpllV). 



(3) 



(4) 



However, the estimator Q is prone to a large statistical er- 
ror. Even worse, the associated uncertainty does not vanish, 
no ma tter how many points we use. As noted by Arad et alj 
i200'tfl . the probability distribution p{p/p) does not ap- 
proach a Dirac delta function, even in the limit — > oo. 
This problem is common to all methods based on a fixed 
number of points (in our case, just one). At constant den- 
sity, increasing the total number of points yields smaller and 
smaller volumes. Since the process is scale-invariant, both 
the probability distribution p{p/ p) and the relative error 
^pI p remain constant. 

Ideally one has just enough particles that the density 
changes by only a small amount between neighbouring par- 
ticles. Increasing the resolution beyond this ideal does not 
of itself reduce the error in the measured density, but it does 
make it possible to reduce the statistical error by smoothing 
our density estimate J^J. FiEstAS implements a kernel in- 
terpolation, similar in spirit to the SPH scheme. For a given 
point X in d-dimensional space, we integrate the mass over 
a region V around x. First, the mass of each data point is 
uniformly distributed within its assigned volume, Vp. Then, 
using equation I^J we calculate 



M(x,V) =^/5j, X (%r\V), 



p=i 



and we set 



p{x,V) 



M{x,V) 
V 



(5) 



(6) 



The shape of V is chosen to be orthohedral, both because of 
computational efficiency and because we do not assume the 



4 Y. Ascasibar and J. Binney 



Figure 2. Smoothing around the position of point 6 in Fig. Ill 
We start with the shadowed box, defined by the nearest faces of 
the cell. We double the box size (dashed lines) until M > Mj;, 
and then find the exact solution (thick solid line) by repeatedly 
halving the search interval (thin solid lines). Numbers at the top 
left corner of each box illustrate this sequence. 



existence of a metric for the space^.We adjust the volume 
V as described below such that M{x,V) = Mk, where the 
kernel mass Mk is the only free parameter in FiEstAS. 
Basing on our preliminary tests, we suggest Mk — lOmp as 
a compromise between accuracy and spatial resolution. 

The boundaries of the volume V are the hyperplanes 
on which the i-th coordinate equals Xi ± Axi, where the 
half- width Axi is determined iteratively (see Fig. |2J. We 
start with Axi equal to the minimum distance along the 
i-th axis to an edge of the hyper-box containing x. With 
this prescription, M 5C nip. We then double all the Axi 
until M > Mk, and then find the exact value by repeatedly 
halving the search interval. 



3 HERNQUIST PROFILE 

In order to test our algorithm, random realizations of a 
lHernaui's3 (Il990l) profile have been generated with different 
numbers of particles, N. We then try to reconstruct the den- 
sity field both in real and phase space, assessing the accuracy 
of our estimation, as well as the importance of smoothing 
and correcting for boundary effects. 

The density in real space is given by 



p{r) = 



M 



1 



2na^ r/a(l + r/a)^ ' 
which leads to a cumulative mass 



M(r) = M 



r/a 



1 +r/a^ 
and a gravitational potential 
GM 1 



$(r-) = 



a 1 + r/a 



(7) 



(8) 



(9) 



^ It would be hard to define any other shape (e.g. a hyper-sphere) 
without specifying a metric. 



The main advantage of the Ifernquist profile, which 
greatly simplifies our analysis, is that the phase-space den- 
sity can be evaluated analytically as a function of energy, 

M/a^ 



47r3(2GA//a)^''^ 



3 sin-i q + q^l-q^{l - 2g2)(8g^ - Sg^ - 3) 

" {T^W^- ' 

where M is the total mass, a is a scale length and 



E 



GUI a (11) 

The lifernauisti lll990l) model has been widely used to 
model the mass distribution of dark matter haloes in nu- 
merical simulations, and it provides a good fit to the ob- 
served surface brightness of bulges and elliptical galaxies. 
Some other profiles have been propo sed in the literature (e.g. 
iNavarro et al.ll997l:lMoore et al.ll999ll that better fit the re- 
sults of cosmological N-body simulations, but they lack an 
analytical expression for the phase-space density. 

The actual generation of random N-body realisations is 
fairly straightforward once the mass profile and the phase- 
space density are known. The radial coordinate is chosen by 
generating a random number x between and 1, and then 
inverting equation ||5J, 



(12) 



a 1 — yfx 

Then, velocities are assigned from the distribution 
p(r) 



(13) 



using von Neumann rejection (see e.g. IPress et al.lll993) . 

We generate a tentative velocity «, uniformly distributed 
between and Wmax = -y^— 2<I>(r), and an auxiliary ran- 
dom number x between and ujjjax/l'^max). The velocity 
is accepted only if a; < j(y). Otherwise, two new random 
numbers are generated for v and x until a value is finally 
accepted. Angular coordinates for both positions and ve- 
locities are obtained from the variables < (/> < 27r and 
-1 < cosS < 1. 



3.1 Real space 

The real-space density obtained from two random realiza- 
tions with A'' — 100 and A — 10^ particles is compared 
in Fig. |3] with the theoretical profile O. We corrected for 
boundary effects and applied smoothing with — lOmp 
in both cases. The mean density and standard its deviation 
were computed in 30 logarithmic bins, subject to the con- 
straint that each bin should contain at least 2 particles. For 
N = 100, the profile is accurately recovered for about one 
decade in radius. In the central regions, lack of resolution 
erases the density cusp, while in the outskirts of the sys- 
tem, the sampling provided by 100 points is too coarse for 
a reliable reconstruction of the density field. 

This can be more clearly seen in Fig. 21 where we plot 
the ratio p/p with respect to the theoretical density p for 
different values of A. In all cases, the estimated density is 
most reliable in the intermediate range. There is some inner 
radius rmin, set by the number of particles, beyond which 
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Figure 3. Reconstruction of a Hernquist density profile from 
two random realizations with A'^ = 100 (left) and A'^ = 10^ (right) 
particles. Thin solid lines show equation Q. The average estima- 
tion given by FiEstAS is plotted as a thick solid line. Dashed and 
dotted lines represent one and three-sigma scatter, respectively. 
The density profile derived from particle counts in logarithmic 
radial bins is shown as open triangles. 
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Figure 4. Systematic dependence of p/p on p for different num- 
bers of particles in a Hernquist sphere. Boundary correction and 
smoothing have been applied in all cases. Line styles are the same 
as in Fig. |2l 

we are not able to resolve the cusp. On the other hand, 
there is some outer radius rmax, also set by A'^, where the 
density becomes so low that there are simply not enough 
points to sample the field, and the estimate p is dominated 
by boundary effects. 

Fig.|S]shows that the boundary correction is relevant for 
the unsmoothed estimator (bottom panels) at low densities, 
but it has a weaker effect on the smoothed estimate (top 
panels). For particles at the boundary, the volume assigned 
by the binary tree is usually too large, because the bounding 
box is a cuboid, whereas the density field has spherical sym- 
metry. There is a lot of empty space from the last particle 
to the appropriate faces of the bounding box, resulting in an 
artificially large value of Vp. The density given by expression 
^ is thus severely underestimated for these particles. When 
the smoothed estimator © is used instead, the effect is more 
limited, since each particle will typically be surrounded by 
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Figure 5. Systematic dependence of p/p on p for different pre- 
scriptions. Density has been smoothed on the top panels. Left 
panels apply the boundary correction. All plots correspond to a 
realization of a Hernquist sphere with A'^ = lO** particles. Line 
styles are the same as in Fig. |3| 
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Figure 6. Probability distribution p(/3/p) for a Hernquist sphere 
with A^ = 10'* particles, using different prescriptions. Correction 
for boundary effects is negligible, while smoothing considerably 
reduces the uncertainty. 

higher density neighbours. These will dominate the volume 
integral, minimising the contamination from artificially low- 
density particles. 

Moreover, the number of particles affected by the 
boundary correction constitutes a relatively small fraction 
of the total. As can be seen in Fig.|Sl the overall distribution 
of p/p is not altered at all by considering boundary effects, 
even for moderate resolutions (A'^ — 10'* particles) . This cor- 
rection becomes more important for very poor resolutions, 
or in higher-dimensionality spaces, in which the fraction of 
data points at the boundary is substantially larger. 

As previously mentioned, the probability distribution 
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Figure 7. Probability distribution p{p/p) for difJerent numbers 
of particles in the Hernquist sphere. The distribution is broader 
and more skewed as A'' decreases. 
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Figure 8. Comparison between FiEstAS (with Mk = 10) and 
SPH (using 32 and 64 neighbours) for a Hernquist sphere with 
A'' = 10'' particles. 

P{p/ p) does not tend to a Dirac delta function as A'^ ^ oo 
(see Fig. . If our point distribution is locally Poissonian, 
the dispersion can be reduced by averaging over a neigh- 
bouring volume containing more particles. The underlying 
density field must be approximately constant (i.e. we must 
have enough resolution) over the smoothing region. This 
assumption breaks down at the very centre, as well as in 
the outermost parts. In these regions, smoothing may even 
worsen the density estimate, but for most particles, it yields 
a significant improvement in acuracy (see Figs |3 and [SJ. 

It is encouraging that p{p/p) varies very little for A'^ > 
10'^ particles, suggesting that it has converged to its asymp- 
totic form. Our p{p/p) is well modelled by a log-normal dis- 
tribution with (log(/3//9)) 0.04 dex and a ^ 0.1 dex, which 
corresponds to a relative error Ap/p smaller than 26 per 
cent. 
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Figure 9. Phase-space density of two random realizations of 
a Hernquist model, with N = 100 (left) and TV = 10^ (right) 
particles. Line styles are the same as in Fig. |21 Thin solid lines 
represent equation llOi . 
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Figure 10. Dependence of /// on / for a Hernquist model 
sampled with various A'^. Smoothing and boundary corrections 
are both on. 

In Fig. m FiEstAS is compared with the popular SPH 
method. We set the smoothing length in the latter using 
32 and 64 neighbours. FiEstAS gives a slightly more ac- 
curate (i.e. lower scatter) estimate, and the results are sig- 
nificantly less biased towards large p/p. In order to obtain 
similar results, the SPH method would require averaging 
over more than 64 neighbouring particles, which is consider- 
ably demanding from the computational point of view, and 
seriously degrades the spatial resolution of the estimator. 

3.2 Phase space 

In this section, we test the performance of our algorithm 
in six-dimensional phase space. From the distribution func- 
tion IIOI I we can randomly sample the phase space with any 
number A'^ of particles. Fig. |U] shows the results for two re- 
alizations of a Hernquist model with different values of A'^. 
For A'' — 100, we are barely able to recover the qualitative 
behaviour of the theoretical profile, while with A'^ = 10^ the 
points scatter around the analytic value over eight orders of 
magnitude in /. 
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Figure 11. Probability distribution p{f/f) for a Hernquist 
model sampled with various N. Smoothing and boundary cor- 
rections are both on. 
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Figure 12. Dependence of /// on / for a Hernquist model of 
A'^ = 10* particles when various combinations of smoothing and 
boundary correction are used. 



The efltect of increasing the number of particles can be 
appreciated in Fig. 1101 More resolution extends the range 
of phase-space densities that are accurately estimated. As 
in the real-space case, lack of resolution leads to systematic 
underestimation of the highest densities, while there is a 
tendency to overestimate the lowest densities. Fig. llll shows 
that with N — 10* the probability distribution is rea- 

sonably unbiased, (log(///)^ ^ 0.04 dex, and has a typical 
spread a ~ 0.1 dex comparable to that in real space. Note 
however that more resolution is obviously needed in order 
to assess convergence. Comparison with the real-space den- 
sities shown in Figs. and [7| suggests that increasing the 
dimensionality of the space from 3 to 6 requires a factor of 
order 10 more points to achieve a similar distribution. In 
general, it seems reasonable to expect that the number of 
points needed to properly sample a given space increases ex- 
ponentially with the number of dimensions d in a way similar 
to the typical number of neighbours, roughly 2'^ . 

We quantify in Fig. ll2l the importance of smoothing and 
boundary corrections. In six dimensions, a very large frac- 
tion of particles lie close to at least one of the 12 faces of the 
bounding hyper-box. The correction for boundary effects is 
thus relevant even for A'^ — 10*. Using the smoothed estima- 
tor considerably reduces the dispersion in log(///), but / 
is still biased low when boundary effects are not accounted 
for. There are so many particles whose initial density (|5J was 
underestimated that they make a significant contribution to 
the volume integral in JSJ. 

Another effect of the coarser sampling in six dimensions 
is that the softening of the cusp is more noticeable than in 
the three-dimensional case. Note also that the slope of the 
phase-space density for _E — > is much steeper than the 
physical density at low r, which makes it difficult to sample 
the high-/ region adequately. 

None the less, the smoothed estimator @ greatly re- 
duces the spread in ///. As can be seen in Fig. ^| the un- 
smoothed estimator yields a dispersion of about one decade 
around the true value of /, regardless of whether the bound- 
ary effects have been corrected or not. Such a large disper- 



Smooth=ON 

Bndry=ON 

Smooth=ON 

Bndry=OFF 

Smooth = OFF 

Bndry=ON 

Smooth^OFF 

Bridry=OFF 




X ^log(f/f) 

Figure 13. Probability distribution p(///) for a Hernquist 
model of N = 10* particles when various combinations of smooth- 
ing and boundary correction are used. 



sion coinc ides approx i matel y with the statistical error re- 
ported bv lArad et aP J2004l) . 

Arad et al. investigated the volume distribution func- 
tion 



^'(/o)= / 5[f{x,v)-fo] d'xd^'v, 



(14) 



which gives the volume of phase space in which / lies in 
the interval (/o, fo + df), i.e. dV = v{f) df. Since the mass 
contained in particles that have phase space densities within 
df of / is 

^ = (15) 
we can estimate v{f) from dm/df. We obtain the latter by 
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Figure 14. Volume distribution v{f) for Hernquist models with 
N = 100 and N = 10^. The analytic result is plotted as a 
dotted line. 

sorting the particles in order of increasing /-values and then 
binning them. In our tests the bins contain 50 particles each, 
except in the case A'^ = 100, when we take only 5 particles 
per bin. 

lArad et alJ i2004l) show that errors in / cause the esti- 
mated volume fimction {;(/) to be the convolution 

/"OO 

«(/o)= / vif)pifo/f)r'df. (16) 
Jo 

For v{f) oc /~", we find that v{fo) oc f^", where the con- 
stant of proportionality is an integral that is independent 
of fo- From this result it follows that when v{f) is well ap- 
proximated locally by a power law over the range in which 
p{f/fo) is non-negligible, the measured logarithmic slope of 
v{f) at fo will equal the logarithmic slope of the underlying 
function «(/). 

When / is a function f{E) of energy alone, we obtain 
the theoretical volume distribution from 

vlfiE)] = (17) 

where g{E) is the density of states and f'{E) denotes the 
derivative of / with respect to E. For a Hernquist profile, / 
is given by equation IjlOjl . and the density of states is 

g{E) = Sqs [ 3(8g^ - 4g V 1) cos ^ g 

-g(l-g^)^/^(4g^-l)(2gV3) ]. (18) 

Fig. 1141 demonstrates that it is possible to recover the 
qualitative behaviour of v{f) even with a resolution as poor 
as N = 100. Deviations from the theoretical volume distri- 
bution are plotted in Fig. llSl for difi'erent numbers of parti- 
cles. Results for N ^ 1000 stay within a factor of 2 of the 
true v{f), although some systematic errors are present near 
the turnover of f{E), where the distribution function cannot 
be described as a power law. 



4 COSMOLOGICAL SIMULATION 

Having assessed the accuracy of our algorithm, we apply 
it to a realistic case in which the phase-space structure is 
not known a priori. We select a ~ lO^^M© halo from an 
N-body simulati on accomplished with the publicly-available 
code Gadget iSoringel et alJ I2001I : ISoringel fc Hernauisl 
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Figure 15. Systematic dependence of ii/v on / for Hernquist 
models with various N. 
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Figure 16. Density profile, both in real (left) and phase (right) 
space, of a 1O^*M0 halo in a numerical simulation (N ~ 10^). 
Solid lines show a Hernquist model with M = 3.6 X IO^^Mq and 
a = 250 kpc. In phase space, v-^ = 3cr^ has been assumed. Dashed 
lines show = and = 9(t^. 

l2002h . The object contains iV = 1 190 016 dark matter par- 
ticles, embedded in a 80 Mpc box in a ACDM universe 
(Odm = 0.3, Qa = 0.7, h = 0.7). A thorough anal ysis of 
these numer ical experiments can be found in Ascasiba r et alJ 
(I200II200I. 

A scatter plot of the real- and phase-space density of 
2 per cent of the particles is shown in Fig. 1161 Substruc- 
ture gives rise to localized density peaks in both spaces, 
whereas most of the mass follows the global trend of de- 
creasing density with radius. This component can be roughly 
described by a Hernquist model with M = 3.6 x 10^'* M© and 
a = 250 kpc, plotted as solid lines in the figure. 

Satellite haloes achieve higher central densities than the 
main object, and the effect is more pronounced in phase- 
space. Therefore, the high end of the volume distribution 
v{f) will be dominated by substructure. We plot this quan- 
tity in Fig. 1171 togeth er wit h the power-law behaviour advo- 
cated bv lArad et all ((200^. Although our results are con- 
sistent with «(/) oc f~^'^ over 4 decades in /, we find de- 
partures from this law both at low an d high phase-spac e 
densities. Similar results are reported bv lArad et al.l i2004l) . 
but deviations from a pure power law are attributed to a lack 
of resolution at high / and incomplete virialization at low 
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Figure 17. Volume distribution of pliase-space density for a 
simulated halo (solid line). Dotted lines plot Hernquist models 
of different masses, while the dashed line shows ^'(/) ex f~^'^. 
Vertical lines mark the expected turnovers in v{f), and arrows 
the resolution limits of this simulation (see text). 



/. We show below that such deviations are indeed expected 
theoretically for systems in virial equilibrium. 

Also shown in Fig. 1171 are the volume distributions of 
several Hernquist models with different masses. We use the 
same 'concentration' c = r2oo /rs = 8 for all of them, where 
r2oo is the radius at which the enclosed density is 200 times 
the critical density and Ts = a/2 is the point at which the 
logarithmic slope of the density profile is equal to —2. With 
this prescription, the mass and scale length are related by 



-200p.^f^ 
3 ^ 2 V2 



+ 1 



(19) 



Note that the volume distribution of the Hernquist 
model tends asymptotically to u(/ 0) (x f~^'^^ and 
v{f^cio) oc f~^'^. The turning point occurs roughly at 



/ ~ 3.25 X 10'** ( ) M0Mpc-^(kms 



and 

V ~ 5.46 X 10-'* (^^^ Mq' Mpc''( kms-')^ 

where m is the mass of the halo. 

In the toy model proposed by lArad et alJ i200' 
total volume distribution is given by the convolution 



v{.f) = 



dn(m) 

Vm.(J) — 5 dm 

dm 



(20) 



(21) 



the 



(22) 



of the individual Vm{f) of each sub-halo with the su b-halo 
mass function dn/dm oc m~^'^ ijPe Lucia et alJl2004 ). It is 
important, though, that the mass m of the satellites cannot 
reach an arbitrary value. On one hand, the resolution of the 
simulation imposes a minimum mass mmin ~ lOOmp. On the 
other hand, there is the physical constraint that m must be 
smaller than the mass of the main object. 
Let us define 



^ 3.25 X 10i8M|Mpc-3(kms-i)-3 ^^^^ 

and approximate Vm{f) as a double power law 

, / 5.46 X 10-'*m'(m<^)-'-^'^ for (f) ^ 1/m , . 
^rnU) ^ I g^^g ^ 10-38m3(m,^)-2-« for ^ 1/m. ^ ' 

We can evaluate expression as 

f i\ J.-1'56 / -0.46. I J.-2.8 

v{j) OC m dm + (p 



m '■'^dm 



3.28 ( 



'"•min 1-1.56 



0.54 ^ 

for mmin <l/(t) < mmax, and 



' "max , 

0.7 ^ 



v{f) oc 



dmm 



l.l+A 



(25) 



(26) 



with A = — 1.56 for < 1/mmax and A = — 2.8 for (j) > 1/mmin. 

According to this model, the logarithmic slope of the 
total volume distribution would vary smoothly from —1.56 
to —2.8 as (p goes from l/mmax to l/mmin, the shape of 
i;(/) being controlled by the values of these parameters. A 
constant power law v{f) oc f^^'^ can only be obtained in 
the limit mmin — > and mmax — > oo. 

The phase-space densities implied by mmin = 3 x 



10^ 



'Mq and mmax = 3.6 x 10' 



Mq are illustrated as verti- 



cal dotted lines in Fig. 1171 We find strong evidence of the 
turnover of v{f) at the low end. The logarithmic slope mea- 
sured by FiEstAS is fiatter than —2.1 over several orders of 
magnitude, and it is close to —1.56 at the physical scale dic- 
tated by the main halo. At the high end, the effects of mmin 
are (not surprisingly) very close to our resolution limit. 

Dynamical discreteness effects place an upper limit on 
the phase-space density at which a given cosmological simu- 
lation is trustworthy. These effects include tho se of granular - 
ity when the first non-lin ear structures form ^|Binnev^l2004^ 
and two-body relaxation jPiemand et al.ll2004l) . 

The phase-space density above which the two-body re- 
laxation time is shorter than the age of the universe is 

f 0-34 1 , . 

(2^)3/2G2 1nAmpto' ^ ' 

For the simulation studied here, m 
13Gyr. Substituting InA ~ 6, 



3 X 10** Mq and to 



= 4.8 X lO^M0Mpc-'(kms-')" 



(28) 



The phase-space density at which the effects discussed 
by Binney become important is 

?7mp (»mPc)^ , . 



H^mp 



{HoAq^y 

where Aq is the comoving interparticle separation and 77 
is a factor a little smaller than unity that depends on the 
precise time at which the first structures become non-linear. 
Therefore, 



/dis 



' 2.3 X IO^Mq Mpc~'(kms-')" 



(30) 



Both estimates yield similar values of the maximum 
phase-space density that can be reliably represented in an 
N-body si mulation (for a detailed discussion, the reader is 
referred to lBinnevll2004l) . The steepening of the volume dis- 
tribution at the resolution scale is a numerical artefact of the 
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simulation, not of our algorithm. The fact that it is detected 
and accurately measured suggests that smoothing and sam- 
pling errors in FiEstAS are significant only at the very large 
values of / at which the underlying simulation is dominated 
by discreteness effects. 

The situation at the low end is far more complicated. 
Although the simple model we have considered in this sec- 
tion offers an interesting insight on the problem, it does 
not yield a reliable prediction of the total volume distri- 
bution. The phase-space volumes occupied by the satellites 
may intersect amongst themselves, and they obviously do 
with the phase-space volume occupied by the main object. 
Therefore, the total v{f) of the system is much less than 
the sum over the individual sub-haloes, as equation 12211 as- 
sumes. The phase-space density fm{r,v) is additive, but the 
volume distribution Vm{f) is not. 

Deriving the shape and normalisation of v{f) would re- 
quire knowledge of the exact distribution of haloes in phase 
space. Qualitatively, we expect that most satellites are lo- 
cated in the low-/ regions of the main halo. The phase-space 
density will increase in these regions, most likely flattening 
v{f) at the low end. Hence, logarithmic slopes even flatter 
than —1.56 would not be unexpected on physical grounds, 
although there are also many numerical effects that con- 
tribute in the same direction. Some of them arise from the N- 
body technique (e.g. the limited number of high-resolution 
particles) and some of them are associated with FiEstAS 
(e.g. systematic bias at low /). ft is difficult to disentangle 
these numerical artefacts from the genuine effects of sub- 
structure. Although we cannot establish a definite value for 
the asymptotic behaviour of «(/) as / ^ 0, we do conclude 
that it is probably fiatter than a substructureless Hernquist 
sphere, «(/ 0) oc /"^'^^ 



5 PERFORMANCE 

State-of-the-art N-body simulations currently yield halos 
with up to A'' ~ 10^ particles. Recent observational archives 
achieve a similar number of objects. However, it seems likely 
that the typical number of points in future astrophysical 
datasets will increas e by orders of magn itude during the 
coming years (see e.g. lSzalav fc Gravl200l)) . Therefore, com- 
putational efficiency has become a crucial issue in data anal- 
ysis. A very accurate algorithm is completely useless if it 
cannot be run in a reasonable amount of time. 

Fig. llBl shows. as a function of the number of points A'', 
the time required to recover densities on a garden-variety 
laptop PC (1.7 GHz, 1 Gb RAM). The open symbols show 
the time required to estimate the real-space density, while 
the filled symbols are for recovery of the phase-space density. 
Our algorithm scales as Alog(A), shown as dotted lines, for 
N J5 10'' particles. The offset between real and phase space 
estimators arises because the typical number of neighbours 
a particle has rises with dimensionality d roughly like 2^*. 

It took FiEstAS 730 s to recover / fr om a set of 
A^ = 10® particles, whereas lArad et al] (|2004') report that 
about a week is required with a Delaunay tessellation. In fact 
the contrast in speed is even more dramatic than this com- 
parison would suggest, since nearly all the measured time 
for FiEstAS is given to smoothing: for N = 10® the cre- 
ation of the binary tree requires only ~ 4s independent of 



log(K) 

Figure 18. Computing time versus particle number for the re- 
covery of real-space density (open symbols) and phase-space den- 
sity (filled symbols). Dotted lines mark Nlog{N) scaling. 



the number of dimensions. The density estimates of Arad 
et al. are unsmoothed ones and show the same large statis- 
tical fluctuations as our expression ||2J. Hence the week of 
computational time reported by Arad et al. ought strictly 
to be compared to the 4 s it takes FiEstAS to produce un- 
smoothed density estimates. 



6 CONCLUSIONS 

In almost every branch of the physical, biological and social 
sciences one frequently wishes to determine the density of 
data points in some space. As all points in a given cluster 
are expected to have a common origin, identifying clusters 
through maxima in the density of points is likely to yield 
clues as to what processes are responsible for distributing 
points in the space. In the general case, each axis will rep- 
resent a quantity with its own particular dimensions, and 
there is no natural deflnition of the distance between two 
data points. 

The Field Estimator for Arbitrary Spaces (FiEstAS) 
that has been presented here provides a fast and accurate 
way of determining the density fleld underlying any given 
distribution of points. FiEstAS obtains a flrst estimate of 
the density by associating a volume with each data point and 
taking the density to be the inverse of that volume. Such an 
estimate inevitably fluctuates by of order 1 dex regardless 
of the number of data points, so it is desirable to smooth 
it. The computational cost of FiEstAS increases with the 
number of data points N as Alog(A), and approximately 
as 2^* (probably d x 2^*) with dimensionality d. For d < 6 and 
N < 10^, smoothed densities within ~ 0.1 dex error bars (as 
long as the field is well sampled) are easily recovered on a 
single-processor computer. 

The higher the dimension of the space within which 
densities are required, the greater is the fraction of the par- 
ticles that lie near the boundary of the sampled volume. The 
determination of the density at the locations of these par- 
ticles is problematical for any method because it depends 
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on what one takes to be the boundary of the sampled vol- 
ume. We have experimented with various schemes for defin- 
ing this boundary, without arriving at a definitive solution 
of the problem. 

In tests with d = 3, we find that FiEstAS returns 
almost unbiased densities with 10^ points in a Ifernquist 
sphere. Statistical errors are ~ 0.1 dex over ten orders of 
magnitude in p. Smoothing tends to overestimate the den- 
sity in low-density regions, while the opposite is true for the 
central, high-density region. Increasing the particle number 
extends the range of densities over which FiEstAS makes 
an unbiased estimate, but does not alter the magnitude of 
the statistical fiuctuations in p. Our tests show that density 
estimates obtained with the popular SPH kernel are signifi- 
cantly more strongly biased (towards large values) than are 
estimates from FiEstAS. 

Most popular density estimators are not well suited to 
the evaluation of phase-space densities / because they re- 
quire the distance between any two points to be defined and 
there is no natural definition of distance in phase space. 
Since FiEstAS does not involve the concept of distance, it 
is well suited to the estimation of /. A possible drawback of 
FiEstAS is its reliance on a coordinate grid, so density es- 
timates will vary slightly wh en the grid i s rotat ed. We con- 
firm the results obtained by lArad et alJ (|20o3) for a dark 
halo from a cosmological simulation, but at greatly reduced 
computational cost: the CPU time required is reduced from 
of order a week for raw densities to less than fifteen minutes 
for smoothed densities. This speedup makes feasible the de- 
termination of phase-space densities in simulations with up 
to 10^ particles. 

Although we confirm that over three to five decades in 
/ the specific volume function follows a power law v{f) oc 
f~^'^, we find that at both the lowest and the highest phase- 
space densities, v{f) falls below this power law, and we be- 
lieve this phenomenon is not an artefact of our analysis. We 
are able to account for the form of v{f) with a model in 
which a massive CDM halo is a superposition of a popu- 
lation of Hernquist profiles. A similar model was originally 
proposed by Arad et al., but in their analysis it did not re- 
produce v{f) because they failed to recognize that in any 
given simulation there is both a smallest and a largest halo 
that can form. 

The interest of FiEstAS extends far beyond the mere 
estimation of densities; it opens up a world of possibilities for 
both supervised and unsupervised classification. As an ex- 
ample of the latter, we are currently developing a halo finder 
in phase space. It has the advantage over conventional, real- 
space based methods that sub-haloes and tidal streams are 
very clearly defined in phase space that are not apparent 
in real space. Another possibility would be the identifica- 
tion of galaxy groups and clusters from observational data. 
Not being based on a particular metric, we expect FiEstAS 
to provide a robust parameter-free method that could ex- 
ploit all the available information, such as coordinates on 
the sky, redshifts, colours, etc. Indeed, FiEstAS constitutes 
a general-purpose data-mining tool with many potential ap- 
plications, not necessarily restricted to the field of Astro- 
physics. 
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